{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polyglot depends on [pycld2](https://pypi.python.org/pypi/pycld2/) library which in turn depends on [cld2](https://code.google.com/p/cld2/) library for detecting language(s) used in plain text." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from polyglot.detect import Detector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "arabic_text = u\"\"\"\n", "أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن \" القوات الامنية تتوقف لليوم\n", "الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب\n", "انتشار قناصي التنظيم الذي يطلق على نفسه اسم \"الدولة الاسلامية\" والعبوات الناسفة\n", "والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية \".\n", "\"\"\"\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: Arabic code: ar confidence: 99.0 read bytes: 907\n" ] } ], "source": [ "detector = Detector(arabic_text)\n", "print(detector.language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mixed Text" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mixed_text = u\"\"\"\n", "China (simplified Chinese: 中国; traditional Chinese: 中國),\n", "officially the People's Republic of China (PRC), is a sovereign state located in East Asia.\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text.\n", "For each language, we can query the model confidence level:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name: English code: en confidence: 87.0 read bytes: 1154\n", "name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755\n", "name: un code: un confidence: 0.0 read bytes: 0\n" ] } ], "source": [ "for language in Detector(mixed_text).languages:\n", " print(language)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "China (simplified Chinese: 中国; traditional Chinese: 中國),\n", "\n", "name: English code: en confidence: 71.0 read bytes: 887\n", "name: Chinese code: zh_Hant confidence: 11.0 read bytes: 1755\n", "name: un code: un confidence: 0.0 read bytes: 0\n", "\n", "\n", "officially the People's Republic of China (PRC), is a sovereign state located in East Asia.\n", "\n", "name: English code: en confidence: 98.0 read bytes: 1291\n", "name: un code: un confidence: 0.0 read bytes: 0\n", "name: un code: un confidence: 0.0 read bytes: 0\n", "\n", "\n" ] } ], "source": [ "for line in mixed_text.strip().splitlines():\n", " print(line + u\"\\n\")\n", " for language in Detector(line).languages:\n", " print(language)\n", " print(\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best Effort Strategy\n", "\n", "Sometimes, there is no enough text to make a decision, like detecting a language from one word.\n", "This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute `reliable` will be set to `False`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Prediction is reliable: False\n", "Language 1: name: English code: en confidence: 85.0 read bytes: 1194\n", "Language 2: name: un code: un confidence: 0.0 read bytes: 0\n", "Language 3: name: un code: un confidence: 0.0 read bytes: 0\n" ] } ], "source": [ "detector = Detector(\"pizza\")\n", "print(detector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case, that the detection is not reliable even when we are using the best effort strategy, an exception `UnknownLanguage` will be thrown." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "ename": "UnknownLanguage", "evalue": "Try passing a longer snippet of text", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mUnknownLanguage\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mDetector\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"4\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, text, quiet)\u001b[0m\n\u001b[0;32m 63\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mquiet\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mquiet\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 64\u001b[0m \u001b[1;34m\"\"\"If true, exceptions will be silenced.\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 65\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdetect\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 66\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 67\u001b[0m \u001b[1;33m@\u001b[0m\u001b[0mstaticmethod\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc\u001b[0m in \u001b[0;36mdetect\u001b[1;34m(self, text)\u001b[0m\n\u001b[0;32m 89\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 90\u001b[0m \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mreliable\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mquiet\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 91\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mUnknownLanguage\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Try passing a longer snippet of text\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 92\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 93\u001b[0m \u001b[0mlogger\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwarning\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Detector is not able to detect the language reliably.\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mUnknownLanguage\u001b[0m: Try passing a longer snippet of text" ] } ], "source": [ "print(Detector(\"4\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages.\n", "In this case, we can silence the exceptions by passing setting `quiet` to `True`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Prediction is reliable: False\n", "Language 1: name: un code: un confidence: 0.0 read bytes: 0\n", "Language 2: name: un code: un confidence: 0.0 read bytes: 0\n", "Language 3: name: un code: un confidence: 0.0 read bytes: 0\n" ] } ], "source": [ "print(Detector(\"4\", quiet=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Command Line" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: polyglot detect [-h] [--input [INPUT [INPUT ...]]]\r\n", "\r\n", "optional arguments:\r\n", " -h, --help show this help message and exit\r\n", " --input [INPUT [INPUT ...]]\r\n" ] } ], "source": [ "!polyglot detect --help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The subcommand `detect` tries to identify the language code for each line in a text file.\n", "This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "English Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs.\r\n", "English David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth.\r\n", "English Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them.\r\n", "English Australia's score surpassed the 413-5 India made against Bermuda in 2007.\r\n", "English It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively.\r\n", "English The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney.\r\n" ] } ], "source": [ "!polyglot detect --input testdata/cricket.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supported Languages\n", "\n", "cld2 can detect up to 165 languages." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1. Abkhazian 2. Afar 3. Afrikaans \n", " 4. Akan 5. Albanian 6. Amharic \n", " 7. Arabic 8. Armenian 9. Assamese \n", " 10. Aymara 11. Azerbaijani 12. Bashkir \n", " 13. Basque 14. Belarusian 15. Bengali \n", " 16. Bihari 17. Bislama 18. Bosnian \n", " 19. Breton 20. Bulgarian 21. Burmese \n", " 22. Catalan 23. Cebuano 24. Cherokee \n", " 25. Nyanja 26. Corsican 27. Croatian \n", " 28. Croatian 29. Czech 30. Chinese \n", " 31. Chinese 32. Chinese 33. Chinese \n", " 34. Chineset 35. Chineset 36. Chineset \n", " 37. Chineset 38. Chineset 39. Chineset \n", " 40. Danish 41. Dhivehi 42. Dutch \n", " 43. Dzongkha 44. English 45. Esperanto \n", " 46. Estonian 47. Ewe 48. Faroese \n", " 49. Fijian 50. Finnish 51. French \n", " 52. Frisian 53. Ga 54. Galician \n", " 55. Ganda 56. Georgian 57. German \n", " 58. Greek 59. Greenlandic 60. Guarani \n", " 61. Gujarati 62. Haitian_creole 63. Hausa \n", " 64. Hawaiian 65. Hebrew 66. Hebrew \n", " 67. Hindi 68. Hmong 69. Hungarian \n", " 70. Icelandic 71. Igbo 72. Indonesian \n", " 73. Interlingua 74. Interlingue 75. Inuktitut \n", " 76. Inupiak 77. Irish 78. Italian \n", " 79. Ignore 80. Javanese 81. Javanese \n", " 82. Japanese 83. Kannada 84. Kashmiri \n", " 85. Kazakh 86. Khasi 87. Khmer \n", " 88. Kinyarwanda 89. Krio 90. Kurdish \n", " 91. Kyrgyz 92. Korean 93. Laothian \n", " 94. Latin 95. Latvian 96. Limbu \n", " 97. Limbu 98. Limbu 99. Lingala \n", "100. Lithuanian 101. Lozi 102. Luba_lulua \n", "103. Luo_kenya_and_tanzania 104. Luxembourgish 105. Macedonian \n", "106. Malagasy 107. Malay 108. Malayalam \n", "109. Maltese 110. Manx 111. Maori \n", "112. Marathi 113. Mauritian_creole 114. Romanian \n", "115. Mongolian 116. Montenegrin 117. Montenegrin \n", "118. Montenegrin 119. Montenegrin 120. Nauru \n", "121. Ndebele 122. Nepali 123. Newari \n", "124. Norwegian 125. Norwegian 126. Norwegian_n \n", "127. Nyanja 128. Occitan 129. Oriya \n", "130. Oromo 131. Ossetian 132. Pampanga \n", "133. Pashto 134. Pedi 135. Persian \n", "136. Polish 137. Portuguese 138. Punjabi \n", "139. Quechua 140. Rajasthani 141. Rhaeto_romance \n", "142. Romanian 143. Rundi 144. Russian \n", "145. Samoan 146. Sango 147. Sanskrit \n", "148. Scots 149. Scots_gaelic 150. Serbian \n", "151. Serbian 152. Seselwa 153. Seselwa \n", "154. Sesotho 155. Shona 156. Sindhi \n", "157. Sinhalese 158. Siswant 159. Slovak \n", "160. Slovenian 161. Somali 162. Spanish \n", "163. Sundanese 164. Swahili 165. Swedish \n", "166. Syriac 167. Tagalog 168. Tajik \n", "169. Tamil 170. Tatar 171. Telugu \n", "172. Thai 173. Tibetan 174. Tigrinya \n", "175. Tonga 176. Tsonga 177. Tswana \n", "178. Tumbuka 179. Turkish 180. Turkmen \n", "181. Twi 182. Uighur 183. Ukrainian \n", "184. Urdu 185. Uzbek 186. Venda \n", "187. Vietnamese 188. Volapuk 189. Waray_philippines \n", "190. Welsh 191. Wolof 192. Xhosa \n", "193. Yiddish 194. Yoruba 195. Zhuang \n", "196. Zulu \n" ] } ], "source": [ "from polyglot.utils import pretty_list\n", "print(pretty_list(Detector.supported_languages()))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }